Embeddings and Pretrained networks¶

Demand of computing power is growing faster than predicted by Moore's Law¶

FLOPS: FLoating point OPerations per Second
What is a Peta FLOP (PFLOP)? - $1000 \cdot 10^{18}$ operations per second
pics are taken form here see also https://twitter.com/Thom_Wolf/status/1617168214139584512?t=VEi4R_mKxxSn0eczHdAr5A&s=03

The number of parameters is growing faster than the memory of the accelerators¶

The size of the Transformers grows 240 times every 2 years

BLOOM

Carbon Footprint of Neural Network Training¶

https://www.technologyreview.com/2019/06/06/239031/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/ Copied from here
Please notice, that this is from 2019.

The next illustration is based on the same publication, but more approchable ;-)

The new reality in Data Science¶

Han Xiao, 2019 Founder and CEO of Jina AI

Examples in NLP¶

Glove¶

Glove dates from 2014. All relevant information can be found on the the project site hosted in stanford. The algorithm is rather easy:
"The GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus."
The basic idea applies to word2vec as well as to Glove:

This is done with so-called matrix-factorization; The matrix is the co-occurence matrix of words in documents.
The example below is a typical example of collaborative filtering. The matrix-factorization has become very popular in the recommender-system community due to the 1-Million-Dollar-Challenge by Netflix.
At start, each word (item, user) is given a random embedding vector. The scalar product (dot-product) between the vectors is asked to reconstruct the content of the respective cell. The error, i.e. the difference between the scalar-product and the actual content of the cell is propagated into the embedding vectors that are adapted accordingly.
The resulting embeddings are able to reconstruct the word co-occurence matrix (or the ratings a user gave a certain movie).

Word2Vec¶

Word2Vec paper

CBOW: take the embeddings of the surrounding words and try to predict the masked (missing) word in the middle.

Skip-Gram: Take the embedding of the word in the middle and try to predict the words around it.

FastText¶

FastText paper. But are more approchable explanation can be found here:
While Glove and Word2Vec work on the word-level, FastText is working on a n-gram level. In this way it is learning the internal structure of words. Thus, FastText has no out-of-vocabulary words (those not present during training) and is able to learn similarities by the word stems.

BERT embeddings¶

BERT is a classical transformer encoding step. Instead of prediction the next word in a sentence (as done with recurrent neural networks such as LSTMs), it is predicting the masked words (ca.~15%). The information of the words present is shared among all positions in the network.
The [CLS] token signals the beginning of a new sentence (its embedding is often used for sentence classification). The [MASK] token signals the places where the correct word has to be guessed; [PAD] is just to fill all input-sentences to the same length. This is more efficient since sentences can be batched together.

Sentence-Transformers¶

The initial paper

The classification head on the left is used during training. For inference the cosine-similarity between the output-embeddings of Sentence A and Sentece B is computed (right side).

Triplet-Loss¶

In each training step, there is a anchor sentence and a positive example that is semantically equivalent to the anchor sentence. Moreover, there are negative examples that are just random sentences not similar to the anchor sentence.
The so-called triplet-loss function pushes the anchor and the positive embeddings closer to each-other (cosine of 1) and the anchor and the negative embeddings as well as the positive embeddings and the negative embeddings further away from each-other (euclidean or cosine).

\begin{equation*} L = \text{max}\left(\sum_i^N \left[(f_i^a - f_i^p)^2 - (f_i^a - f_i^n)^2 + \alpha\right], 0\right) \end{equation*}

where $\alpha$ is the margin, the amount by which negative examples have to been further away from the anchor than positive examples.

multilingual sentence transformer¶

This is the publication on this ingenious idea.

The student model is the XLMR from Facebook.

Object Detection with the OpenImages v4 classes¶

The paper belonging to this project is this
The code is a slightly modified copy from the one in the corresponding github-repository

OpenImagesV4

Understanding the Backbone¶

$\mathbf{x}_g$, the global or scene-context features as well as the regional features $\mathbf{x}_r$ are extracted for the different layers of an old-school VVG19 net.

Joint Visual-Semantic Space¶

The target of the BiAM-Network are the Glove-embeddings of the OpenImages-Labels. The Network is trained to output embedding-vectors that are as near as possible to the embeddings of the labels.
This is also how the zero-shot learning takes place: New objects that are semantically similar to trained objects should have similar embeddings coming from BiAM. These embeddings are compared (cosine) with the Glove-embeddings of all possible objects (also those not within the training set).

OCR¶

  • easyocr
  • tesseract

easyocr¶

easyocr on github
The easyocr-Algorithm dates from 2015. Considered, that we use neural networks, this is rather 'old'. However, for a first check, to see if the whole pipeline could work with OCR it's good enough. Perhaps it will turn out that it's even not the processing step in the pipeline that has to be improved upon the most.
The model consists of several steps:

  1. a convolutional layer extracting feature maps (VGG is one of the most often used networks for this task)
  2. the vectors (columns) representing the different positions in the input image are fed to a Recurrent Neural Network (RNN) (LSTM). This RNN predicts the most probable sequence of characters (and blanks) over the input feature maps.
  3. The Transcription Layer uses a 'dictionary approach' to clean the previous predictions and to return entire words.

tesseract¶

Tesseract is around for more than 15 years. It is open-source and used in most non-commercial software with OCR-capabilities. In 2016 a LSTM was added to its processing pipeline.
Good explanations and instructions (also for installing) can be found here.